Adjusted Mutual Information (AMI) — adjusted_mutual_info_score#

Adjusted Mutual Information (AMI) compares two cluster labelings (two partitions of the same n samples) and corrects Mutual Information (MI) for agreement expected by chance.

AMI is an external clustering metric: it needs two labelings (e.g., ground truth classes and predicted clusters, or two different clustering algorithms).

Goals

  • Build intuition: contingency tables → MI → chance correction → AMI.

  • Implement AMI from scratch in NumPy (including Expected MI).

  • Visualize how AMI behaves (permutation invariance, noise, number of clusters).

  • Use AMI as a black-box objective to tune a simple model (logistic regression threshold / regularization).

Quick import

from sklearn.metrics import adjusted_mutual_info_score
import numpy as np

import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from scipy.special import gammaln

from sklearn.datasets import make_blobs, make_classification
from sklearn.metrics import (
    accuracy_score,
    adjusted_mutual_info_score as skl_adjusted_mutual_info_score,
)
from sklearn.model_selection import train_test_split

pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

rng = np.random.default_rng(0)

1) The core object: the contingency table#

Let two labelings of the same n samples be:

  • True partition: (U = {U_1,\dots,U_R})

  • Predicted partition: (V = {V_1,\dots,V_C})

Define the contingency table (N\in\mathbb{N}^{R\times C}) by

[ N_{ij} = |U_i \cap V_j|. ]

Row and column sums are

[ a_i = \sum_j N_{ij} = |U_i|,\qquad b_j = \sum_i N_{ij} = |V_j|,\qquad n = \sum_i a_i = \sum_j b_j. ]

From counts we can define empirical probabilities:

[ p_{ij} = \frac{N_{ij}}{n},\qquad p_{i\cdot} = \frac{a_i}{n},\qquad p_{\cdot j} = \frac{b_j}{n}. ]

AMI is computed entirely from (N) (it ignores feature geometry).

def _relabel_to_integers(labels: np.ndarray) -> np.ndarray:
    labels = np.asarray(labels)
    _, relabeled = np.unique(labels, return_inverse=True)
    return relabeled


def contingency_matrix_np(labels_true: np.ndarray, labels_pred: np.ndarray) -> np.ndarray:
    # Contingency table N_ij = |U_i ∩ V_j| for two labelings.
    true_int = _relabel_to_integers(labels_true)
    pred_int = _relabel_to_integers(labels_pred)

    n_true = int(true_int.max()) + 1
    n_pred = int(pred_int.max()) + 1

    contingency = np.zeros((n_true, n_pred), dtype=np.int64)
    np.add.at(contingency, (true_int, pred_int), 1)
    return contingency

2) Mutual Information (MI)#

The mutual information between partitions (U) and (V) is the mutual information between the discrete random variables “which cluster am I in?” under each labeling.

[ \mathrm{MI}(U,V) = \sum_{i=1}^{R}\sum_{j=1}^{C} p_{ij},\log\frac{p_{ij}}{p_{i\cdot},p_{\cdot j}}. ]

Equivalent (using counts) for (N_{ij} > 0):

[ \mathrm{MI}(U,V) = \sum_{i,j: N_{ij}>0} \frac{N_{ij}}{n},\log\left(\frac{n,N_{ij}}{a_i,b_j}\right). ]

The entropies are [ H(U) = -\sum_i p_{i\cdot}\log p_{i\cdot},\qquad H(V) = -\sum_j p_{\cdot j}\log p_{\cdot j}, ] and (\mathrm{MI}(U,V) = H(U) + H(V) - H(U,V)).

Units: with natural logs, MI is in nats (use (\log_2) for bits).

Why MI needs adjustment#

MI tends to increase when you use more clusters (more degrees of freedom), even for unrelated/random labelings. AMI fixes that by subtracting the MI you’d expect by chance given the cluster sizes.

def entropy_from_counts(counts: np.ndarray) -> float:
    counts = np.asarray(counts, dtype=np.float64)
    n = counts.sum()
    if n == 0:
        return 0.0

    p = counts[counts > 0] / n
    return float(-np.sum(p * np.log(p)))


def mutual_info_from_contingency(contingency: np.ndarray) -> float:
    # Mutual information in nats, computed from a contingency table.
    contingency = np.asarray(contingency, dtype=np.int64)
    n = int(contingency.sum())
    if n == 0:
        return 0.0

    row_sums = contingency.sum(axis=1)
    col_sums = contingency.sum(axis=0)

    row_idx, col_idx = np.nonzero(contingency)
    nij = contingency[row_idx, col_idx].astype(np.float64)

    # MI = sum (nij/n) * log( (n*nij)/(ai*bj) )
    numerator = nij * n
    denominator = row_sums[row_idx] * col_sums[col_idx]

    return float(np.sum((nij / n) * np.log(numerator / denominator)))

3) Chance correction: Expected Mutual Information (EMI)#

AMI uses the expected MI under the null model:

  • the two labelings are random

  • but their cluster size distributions ({a_i}) and ({b_j}) are fixed (same marginals as observed)

This is the “generalized hypergeometric” model.

For a fixed pair of clusters ((i,j)), the intersection size (K = N_{ij}) follows a hypergeometric distribution:

[ K \sim \mathrm{Hypergeometric}(N=n,; K=a_i,; n=b_j) ]

so

[ \mathbb{P}(K=k) = \frac{\binom{a_i}{k}\binom{n-a_i}{b_j-k}}{\binom{n}{b_j}},\qquad k\in[\max(0,a_i+b_j-n),\min(a_i,b_j)]. ]

Then

[ \mathrm{EMI} = \mathbb{E}[\mathrm{MI}(U,V)] = \sum_{i=1}^{R}\sum_{j=1}^{C}\sum_{k} \frac{k}{n},\log\left(\frac{n,k}{a_i,b_j}\right),\mathbb{P}(K=k). ]

This looks heavy, but for typical cluster counts it’s feasible (the inner sum only ranges up to (\min(a_i,b_j))).

def expected_mutual_info_from_marginals(row_sums: np.ndarray, col_sums: np.ndarray) -> float:
    # Expected mutual information under the hypergeometric null (fixed marginals).
    row_sums = np.asarray(row_sums, dtype=np.int64)
    col_sums = np.asarray(col_sums, dtype=np.int64)

    n = int(row_sums.sum())
    if n == 0:
        return 0.0

    # log-factorials for stable combinatorics: log(n!) = gammaln(n+1)
    log_factorial = gammaln(np.arange(n + 1, dtype=np.float64) + 1.0)

    def log_comb(n_: int, k_: np.ndarray | int) -> np.ndarray:
        k_arr = np.asarray(k_, dtype=np.int64)
        return log_factorial[n_] - log_factorial[k_arr] - log_factorial[n_ - k_arr]

    expected_mi = 0.0

    for a in row_sums:
        for b in col_sums:
            k_min = max(0, a + b - n)
            k_max = min(a, b)
            if k_max < k_min:
                continue

            # Include k=0 in the probability normalization. (Its contribution to MI is 0.)
            k = np.arange(k_min, k_max + 1, dtype=np.int64)

            # Hypergeometric PMF in log-space:
            # P(K=k) = C(a,k) * C(n-a, b-k) / C(n,b)
            log_p = log_comb(a, k) + log_comb(n - a, b - k) - log_comb(n, b)

            # Normalize to avoid under/overflow in exp.
            m = float(log_p.max())
            p = np.exp(log_p - m)
            p = p / p.sum()

            mask = k > 0
            if not np.any(mask):
                continue

            k_pos = k[mask]
            term = (k_pos / n) * np.log((n * k_pos) / (a * b))
            expected_mi += float(np.sum(p[mask] * term))

    return float(expected_mi)


def adjusted_mutual_info_score_np(
    labels_true: np.ndarray,
    labels_pred: np.ndarray,
    *,
    average_method: str = "arithmetic",
) -> float:
    # Adjusted mutual information (AMI) in NumPy (+ gammaln for stability).
    contingency = contingency_matrix_np(labels_true, labels_pred)
    n = int(contingency.sum())
    if n == 0:
        return 1.0

    row_sums = contingency.sum(axis=1)
    col_sums = contingency.sum(axis=0)

    h_true = entropy_from_counts(row_sums)
    h_pred = entropy_from_counts(col_sums)

    # Degenerate case: both partitions put everything in one cluster.
    if h_true == 0.0 and h_pred == 0.0:
        return 1.0

    mi = mutual_info_from_contingency(contingency)
    emi = expected_mutual_info_from_marginals(row_sums, col_sums)

    if average_method == "arithmetic":
        normalizer = 0.5 * (h_true + h_pred)
    elif average_method == "min":
        normalizer = min(h_true, h_pred)
    elif average_method == "max":
        normalizer = max(h_true, h_pred)
    elif average_method == "geometric":
        normalizer = float(np.sqrt(h_true * h_pred))
    else:
        raise ValueError(
            "average_method must be one of: 'arithmetic', 'min', 'max', 'geometric'"
        )

    denom = normalizer - emi
    if denom <= np.finfo(float).eps:
        return 0.0

    return float((mi - emi) / denom)

4) AMI definition (and why it’s useful)#

Given MI and its expectation under chance, AMI is

[ \mathrm{AMI}(U,V) = \frac{\mathrm{MI}(U,V) - \mathrm{EMI}}{\mathrm{Normalizer} - \mathrm{EMI}}. ]

In scikit-learn, the normalizer is based on (H(U)) and (H(V)) via average_method:

  • arithmetic: (\frac{H(U) + H(V)}{2}) (default)

  • min: (\min(H(U), H(V)))

  • max: (\max(H(U), H(V)))

  • geometric: (\sqrt{H(U)H(V)})

Key properties

  • Permutation invariant: renaming cluster IDs doesn’t change AMI.

  • Chance-corrected: unrelated random partitions have AMI ≈ 0.

  • Comparable across different numbers of clusters (unlike raw MI).

  • Can be negative if agreement is worse than chance (under this null model).

5) Sanity check vs scikit-learn#

Below we compare our NumPy implementation to sklearn.metrics.adjusted_mutual_info_score on random labelings.

def _assert_matches_sklearn(n_trials: int = 200) -> None:
    for average_method in ["arithmetic", "min", "max", "geometric"]:
        for n in [10, 50, 200]:
            for _ in range(n_trials):
                y_true = rng.integers(0, 6, size=n)
                y_pred = rng.integers(0, 8, size=n)

                ours = adjusted_mutual_info_score_np(
                    y_true, y_pred, average_method=average_method
                )
                skl = skl_adjusted_mutual_info_score(
                    y_true, y_pred, average_method=average_method
                )
                if not np.isclose(ours, skl, atol=1e-10, rtol=0.0):
                    raise AssertionError(
                        f"Mismatch (n={n}, average_method={average_method}): "
                        f"ours={ours}, sklearn={skl}"
                    )


_assert_matches_sklearn(n_trials=50)

6) Toy examples: permutation invariance and errors#

Let’s build three predicted labelings from the same ground truth:

  1. Perfect match up to a label permutation

  2. A few mistakes (some points moved to the wrong cluster)

  3. Completely random labels

We’ll look at the contingency tables and AMI values.

y_true = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2])

# Perfect up to relabeling: 0→2, 1→0, 2→1
perm = {0: 2, 1: 0, 2: 1}
y_pred_permuted = np.array([perm[v] for v in y_true])

# A few mistakes
y_pred_noisy = y_pred_permuted.copy()
y_pred_noisy[[1, 4, 7]] = [0, 1, 2]

# Random labels with 3 clusters
rng_local = np.random.default_rng(1)
y_pred_random = rng_local.integers(0, 3, size=y_true.size)

examples = {
    "perfect (permuted labels)": y_pred_permuted,
    "noisy": y_pred_noisy,
    "random": y_pred_random,
}

scores = {
    name: adjusted_mutual_info_score_np(y_true, y_pred)
    for name, y_pred in examples.items()
}

scores
{'perfect (permuted labels)': 1.0,
 'noisy': 0.16502260912888653,
 'random': 0.12559873432804056}
fig = make_subplots(
    rows=1,
    cols=3,
    subplot_titles=[
        f"{name}<br>AMI={scores[name]:.3f}" for name in examples.keys()
    ],
)

for col, (name, y_pred) in enumerate(examples.items(), start=1):
    cont = contingency_matrix_np(y_true, y_pred)
    fig.add_trace(
        go.Heatmap(
            z=cont,
            coloraxis="coloraxis",
            text=cont,
            texttemplate="%{text}",
            x=[f"pred {j}" for j in range(cont.shape[1])],
            y=[f"true {i}" for i in range(cont.shape[0])],
        ),
        row=1,
        col=col,
    )

fig.update_layout(
    height=320,
    width=950,
    title_text="Contingency tables (counts)",
    coloraxis=dict(colorscale="Blues"),
)

fig

7) How AMI behaves with label noise#

We start from the true labels, then randomly “corrupt” a fraction (p) of labels (assigning them to a random cluster). AMI should decrease smoothly from near 1 toward ~0.

We also show that AMI stays near 0 for random labels, even if we keep the number of clusters fixed.

X, y_true = make_blobs(
    n_samples=400,
    centers=4,
    cluster_std=1.2,
    random_state=0,
)

n = y_true.size
k_true = len(np.unique(y_true))

noise_grid = np.linspace(0.0, 0.9, 19)

amis = []
for p in noise_grid:
    y_pred = y_true.copy()
    n_flip = int(round(p * n))
    flip_idx = rng.choice(n, size=n_flip, replace=False)

    # Random reassignment (may coincidentally keep the same label for some points)
    y_pred[flip_idx] = rng.integers(0, k_true, size=n_flip)

    amis.append(adjusted_mutual_info_score_np(y_true, y_pred))

# Baseline: fully random labels with the same number of clusters
n_trials = 80
ami_random = np.array(
    [
        adjusted_mutual_info_score_np(y_true, rng.integers(0, k_true, size=n))
        for _ in range(n_trials)
    ]
)

fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=noise_grid,
        y=amis,
        mode="lines+markers",
        name="AMI (corrupted labels)",
    )
)
fig.add_trace(
    go.Scatter(
        x=[noise_grid.min(), noise_grid.max()],
        y=[ami_random.mean(), ami_random.mean()],
        mode="lines",
        name=f"AMI mean (random labels, k={k_true})",
        line=dict(dash="dash"),
    )
)
fig.add_trace(
    go.Scatter(
        x=[noise_grid.min(), noise_grid.max()],
        y=[np.quantile(ami_random, 0.1), np.quantile(ami_random, 0.1)],
        mode="lines",
        line=dict(width=0),
        showlegend=False,
    )
)
fig.add_trace(
    go.Scatter(
        x=[noise_grid.min(), noise_grid.max()],
        y=[np.quantile(ami_random, 0.9), np.quantile(ami_random, 0.9)],
        mode="lines",
        fill="tonexty",
        line=dict(width=0),
        name="random AMI 10%–90% band",
    )
)

fig.update_layout(
    title="AMI vs label noise",
    xaxis_title="fraction of labels corrupted",
    yaxis_title="adjusted mutual information (AMI)",
    height=420,
)

fig

8) Why the adjustment matters: number of clusters bias#

Raw MI (and many related measures) can grow when the predicted labeling uses more clusters, even if it’s random.

Below we sample random predicted labels with different k and plot both MI and AMI. If the labels are unrelated to the ground truth, AMI should stay near 0 regardless of k, while MI tends to drift upward.

k_grid = np.arange(2, 31)

mi_by_k = []
ami_by_k = []
for k in k_grid:
    y_pred = rng.integers(0, k, size=n)
    cont = contingency_matrix_np(y_true, y_pred)

    mi_by_k.append(mutual_info_from_contingency(cont))
    ami_by_k.append(adjusted_mutual_info_score_np(y_true, y_pred))

fig = make_subplots(rows=1, cols=1, specs=[[{"secondary_y": True}]])
fig.add_trace(go.Scatter(x=k_grid, y=mi_by_k, mode="lines+markers", name="MI (nats)"), secondary_y=False)
fig.add_trace(go.Scatter(x=k_grid, y=ami_by_k, mode="lines+markers", name="AMI"), secondary_y=True)

fig.update_layout(
    title="Random labels: MI increases with k, AMI stays near 0",
    xaxis_title="k (number of predicted clusters)",
    height=430,
)
fig.update_yaxes(title_text="MI (nats)", secondary_y=False)
fig.update_yaxes(title_text="AMI", secondary_y=True)

fig

9) AMI under chance: distribution#

Even under the null model, AMI fluctuates around 0 for finite n. We can visualize this by sampling many random predicted labelings.

k = k_true
n_trials = 250
ami_samples = np.array(
    [
        adjusted_mutual_info_score_np(y_true, rng.integers(0, k, size=n))
        for _ in range(n_trials)
    ]
)

fig = px.histogram(
    ami_samples,
    nbins=30,
    title=f"AMI distribution for random labels (k={k}, n={n})",
    labels={"value": "AMI"},
)
fig.add_vline(x=float(ami_samples.mean()), line_dash="dash", line_color="black")
fig.update_layout(height=420)

fig

10) Using AMI as an optimization target (logistic regression)#

AMI is defined on discrete labels, so it’s not differentiable with respect to model parameters. That means:

  • You typically don’t train models by gradient descent on AMI directly.

  • You can use AMI as a black-box objective for:

    • choosing thresholds

    • selecting hyperparameters (regularization strength, features, etc.)

    • model selection

Below is a from-scratch logistic regression trained with log-loss, and then tuned by maximizing AMI on a validation set.

def sigmoid(z: np.ndarray) -> np.ndarray:
    return 1.0 / (1.0 + np.exp(-z))


def standardize_fit(X: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
    mean = X.mean(axis=0)
    std = X.std(axis=0)
    std = np.where(std == 0.0, 1.0, std)
    return mean, std


def standardize_transform(X: np.ndarray, mean: np.ndarray, std: np.ndarray) -> np.ndarray:
    return (X - mean) / std


def add_intercept(X: np.ndarray) -> np.ndarray:
    return np.c_[np.ones((X.shape[0], 1)), X]


def train_logistic_regression_gd(
    X: np.ndarray,
    y: np.ndarray,
    *,
    lr: float = 0.2,
    n_steps: int = 2000,
    reg_strength: float = 0.0,
) -> np.ndarray:
    # L2-regularized logistic regression with gradient descent.
    # reg_strength corresponds to λ in: loss + (λ/2)||w||^2 (excluding intercept).
    X = np.asarray(X, dtype=np.float64)
    y = np.asarray(y, dtype=np.float64)

    n_samples, n_features = X.shape
    w = np.zeros(n_features, dtype=np.float64)

    for _ in range(n_steps):
        p = sigmoid(X @ w)
        grad = (X.T @ (p - y)) / n_samples

        # L2 penalty (do not regularize intercept)
        grad[1:] += reg_strength * w[1:]

        w -= lr * grad

    return w


def predict_proba_logreg(X: np.ndarray, w: np.ndarray) -> np.ndarray:
    return sigmoid(X @ w)
X, y = make_classification(
    n_samples=800,
    n_features=2,
    n_redundant=0,
    n_informative=2,
    n_clusters_per_class=1,
    class_sep=1.6,
    random_state=0,
)

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.35, random_state=0, stratify=y
)

mean, std = standardize_fit(X_train)
X_train_std = standardize_transform(X_train, mean, std)
X_val_std = standardize_transform(X_val, mean, std)

X_train_ = add_intercept(X_train_std)
X_val_ = add_intercept(X_val_std)

reg_grid = np.array([0.0, 1e-4, 1e-3, 1e-2, 1e-1, 1.0])
threshold_grid = np.linspace(0.05, 0.95, 37)

ami_grid = np.zeros((reg_grid.size, threshold_grid.size))
acc_grid = np.zeros((reg_grid.size, threshold_grid.size))

for i, reg in enumerate(reg_grid):
    w = train_logistic_regression_gd(
        X_train_, y_train, lr=0.25, n_steps=2500, reg_strength=float(reg)
    )
    p_val = predict_proba_logreg(X_val_, w)

    for j, thr in enumerate(threshold_grid):
        y_hat = (p_val >= thr).astype(int)
        ami_grid[i, j] = adjusted_mutual_info_score_np(y_val, y_hat)
        acc_grid[i, j] = accuracy_score(y_val, y_hat)

best_idx = np.unravel_index(np.argmax(ami_grid), ami_grid.shape)
best_reg = float(reg_grid[best_idx[0]])
best_thr = float(threshold_grid[best_idx[1]])
best_ami = float(ami_grid[best_idx])

(best_reg, best_thr, best_ami)
(0.0, 0.625, 0.9249381865602224)
fig = go.Figure(
    data=go.Heatmap(
        z=ami_grid,
        x=threshold_grid,
        y=[str(v) for v in reg_grid],
        colorscale="Viridis",
        colorbar=dict(title="AMI"),
    )
)

fig.update_layout(
    title=f"Validation AMI for logistic regression (best: λ={best_reg}, thr={best_thr:.2f}, AMI={best_ami:.3f})",
    xaxis_title="threshold",
    yaxis_title="L2 regularization strength λ",
    height=480,
)

fig
# Compare the AMI-optimal threshold to accuracy-optimal threshold for the best λ

w_best = train_logistic_regression_gd(
    X_train_, y_train, lr=0.25, n_steps=2500, reg_strength=best_reg
)
p_val = predict_proba_logreg(X_val_, w_best)

ami_curve = np.array(
    [
        adjusted_mutual_info_score_np(y_val, (p_val >= t).astype(int))
        for t in threshold_grid
    ]
)
acc_curve = np.array([accuracy_score(y_val, (p_val >= t).astype(int)) for t in threshold_grid])

thr_acc = float(threshold_grid[np.argmax(acc_curve)])

fig = make_subplots(rows=1, cols=1, specs=[[{"secondary_y": True}]])
fig.add_trace(go.Scatter(x=threshold_grid, y=ami_curve, name="AMI"), secondary_y=False)
fig.add_trace(go.Scatter(x=threshold_grid, y=acc_curve, name="Accuracy"), secondary_y=True)

fig.add_vline(x=best_thr, line_dash="dash", line_color="black")
fig.add_vline(x=thr_acc, line_dash="dot", line_color="gray")

fig.update_layout(
    title=f"Threshold tuning at λ={best_reg} (dash: best AMI, dot: best accuracy)",
    xaxis_title="threshold",
    height=430,
)
fig.update_yaxes(title_text="AMI", secondary_y=False)
fig.update_yaxes(title_text="Accuracy", secondary_y=True)

fig

11) Pros, cons, and when to use AMI#

Pros

  • Permutation invariant (cluster labels are arbitrary).

  • Corrects for chance: random labelings tend to score near 0.

  • Works when the two labelings have different numbers of clusters.

  • Symmetric: (\mathrm{AMI}(U,V)=\mathrm{AMI}(V,U)).

Cons / pitfalls

  • Requires two labelings; not an intrinsic clustering quality measure (can’t tell if clusters are “good” in feature space).

  • Based only on the contingency table; ignores geometry (two clusterings with the same contingency are indistinguishable).

  • Not differentiable → not a natural training loss; typically used for evaluation or black-box tuning.

  • Computing EMI can be more expensive than simpler indices (though usually fine for moderate n and cluster counts).

  • Interpretation is less immediate than, say, accuracy; AMI is best used comparatively.

Good use cases

  • Evaluating clustering against ground truth classes (external validation).

  • Measuring agreement between two clustering algorithms / runs (stability studies).

  • Hyperparameter selection for clustering when a reference labeling is available (e.g., choose k in KMeans by maximizing AMI).

Exercises#

  1. Modify the noise experiment to keep the cluster sizes fixed while permuting membership. How does AMI behave?

  2. Implement normalized_mutual_info_score from the same building blocks and compare it to AMI across different k.

  3. Use AMI to select k for a simple KMeans implementation on make_blobs (treat the true labels as reference).

References#

  • scikit-learn API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_mutual_info_score.html

  • Vinh, Epps, Bailey (2010): Information Theoretic Measures for Clusterings Comparison